## Arman Afrasiyabi Université Laval

Arman.afrasiyabi.1@ulaval.ca

doi:10.1038/nature20101

#### Hybrid computing using a neural network with dynamic external memory

Alex Graves<sup>1</sup>\*, Greg Wayne<sup>1</sup>\*, Malcolm Reynolds<sup>1</sup>, Tim Harley<sup>1</sup>, Ivo Danihelka<sup>1</sup>, Agnieszka Grabska-Barwińska<sup>1</sup>, Sergio Gómez Colmenarejo<sup>1</sup>, Edward Grefenstette<sup>1</sup>, Tiago Ramalho<sup>1</sup>, John Agapiou<sup>1</sup>, Adrià Puigdomènech Badia<sup>1</sup>, Karl Moritz Hermann<sup>1</sup>, Yori Zwols<sup>1</sup>, Georg Ostrovski<sup>1</sup>, Adam Cain<sup>1</sup>, Helen King<sup>1</sup>, Christopher Summerfield<sup>1</sup>, Phil Blunsom<sup>1</sup>, Koray Kavukcuoglu1 & Demis Hassabis1

Artificial neural networks are remarkably adept at sensory processing, sequence learning and reinforcement learning, but are limited in their ability to represent variables and data structures and to store data over long timescales, owing to the lack of an external memory. Here we introduce a machine learning model called a differentiable neural computer (DNC), which consists of a neural network that can read from and write to an external memory matrix, analogous to the random-access memory in a conventional computer. Like a conventional computer, it can use its memory to represent and manipulate complex data structures, but, like a neural network, it can learn to do so from data. When trained with supervised learning, we demonstrate that a DNC can successfully answer synthetic questions designed to emulate reasoning and inference problems in natural language. We show that it can learn tasks such as finding the shortest path between specified points and inferring the missing links in randomly generated graphs, and then generalize these tasks to specific graphs such as transport networks and family trees. When trained with reinforcement learning, a DNC can complete a moving blocks puzzle in which changing goals are specified by sequences of symbols. Taken together, our results demonstrate that DNCs have the capacity to solve complex, structured tasks that are inaccessible to neural networks without external read-write memory.

is performed by a processor, which can use an addressable memory to bring operands in and out of play. This confers two important benefits: the use of extensible storage to write new information and the ability to treat the contents of memory as variables. Variables are critical to algorithm generality: to perform the same procedure on one datum or another, an algorithm merely has to change the address it reads from. In contrast to computers, the computational and memory resources of artificial neural networks are mixed together in the network weights and neuron activity. This is a major liability: as the memory demands ically, nor easily learn algorithms that act independently of the values realized by the task variables.

are remarkably adept at sensory processing<sup>1</sup>, sequence learning<sup>2,3</sup> and  $r = \sum_{i=1}^{n} M[i, ] \mathbf{w}^{r}[i]$ , where the '' denotes all i = 1, ..., W. Similarly, reinforcement learning<sup>4</sup>, cognitive scientists and neuroscientists have the write operation uses a write weighting  $w^w$  to first erase with argued that neural networks are limited in their ability to represent an erase vector e, then add a write vector v:  $M[i,j] \leftarrow M[i,j]$ variables and data structures<sup>5-9</sup>, and to store data over long timescales  $(1 - \mathbf{w}^{\mathbf{w}}[i]\mathbf{e}[j]) + \mathbf{w}^{\mathbf{w}}[i]\mathbf{v}[j]$ . The functional units that determine and without interference<sup>10,11</sup>. We aim to combine the advantages of neural and computational processing by providing a neural network with the heads is illustrated in Fig. 1 and summarized below; see Methods read-write access to external memory. The access is narrowly focused, for a formal description. minimizing interference among memoranda and enabling long-term storage<sup>12,13</sup>. The whole system is differentiable, and can therefore be trained end-to-end with gradient descent, allowing the network to learn how to operate and organize the memory in a goal-directed manner.

Modern computers separate computation and memory. Computation RAM, then the network, referred to as the 'controller', is a differentiable CPU whose operations are learned with gradient descent. The DNC architecture differs from recent neural memory frameworks 14,15 in that the memory can be selectively written to as well as read, allowing iterative modification of memory content. An earlier form of DNC, the neural Turing machine 16, had a similar structure, but more limited memory access methods (see Methods for further discussion).

Whereas conventional computers use unique addresses to access memory contents, a DNC uses differentiable attention mechanisms<sup>2,16–18</sup> to define distributions over the N rows, or 'locations', of a task increase, these networks cannot allocate new storage dynamint he  $N \times W$  memory matrix M. These distributions, which we call weightings, represent the degree to which each location is involved in a read or write operation. The read vector r returned by a read weighting Although recent breakthroughs demonstrate that neural networks  $w^r$  over memory M is a weighted sum over the memory locations: apply the weightings are called read and write heads. The operation of

#### Interaction between the heads and the memory

The heads use three distinct forms of differentiable attention. The first is content lookup 16,17,19-21, in which a key vector emitted by the controller is compared to the content of each location in memory according to a similarity measure (here, cosine similarity). The similarity scores A DNC is a neural network coupled to an external memory matrix. determine a weighting that can be used by the read heads for asso-(The behaviour of the network is independent of the memory size as ciative recall<sup>19</sup> or by the write head to modify an existing vector in long as the memory is not filled to capacity, which is why we view the memory. Importantly, a key that only partially matches the content of memory as 'external'.) If the memory can be thought of as the DNC's a memory location can still be used to attend strongly to that location.

<sup>&</sup>lt;sup>1</sup>Google DeepMind, 5 New Street Square, London EC4A 3TW, UK.

<sup>\*</sup>These authors contributed equally to this work

## Differentiable Neural Computer (DNC; 2016)

Neural Turing Machine (NTM; 2014)



All parts are differentiable!

#### Architecture- Differentiable Neural Computing (DNC)



#### Architecture - Differentiable Neural Computing (DNC)



#### The Controller **N**

The controller can be any neural network architecture (in this case a deep LSTM)

#### Input:

$$X_{t} = [x_{t}; \mathbf{r}_{t-1}^{1}; ...; \mathbf{r}_{t-1}^{R}]$$

**Controller:** 

$$(\mathbf{v}_t, \boldsymbol{\xi}_t) = \mathbf{N}([\mathbf{X}_1; ..., \mathbf{X}_t]; \theta)$$

Output:

$$y_t = v_t + \mathbf{W}_r \left[ r_t^1, ..., r_t^R \right]$$

#### The controller **№** (LSTM)

In each layer,  $l \in \{1,...,L\}$  we compute the following functions:

$$\begin{split} i_{t}^{l} &= \sigma(\mathbf{W}_{i}^{l}[\mathbf{X}_{t}; \mathbf{h}_{t-1}^{l}; \mathbf{h}_{t}^{l-1}] + \mathbf{b}_{i}^{l}) \\ f_{t}^{l} &= \sigma(\mathbf{W}_{f}^{l}[\mathbf{X}_{t}; \mathbf{h}_{t-1}^{l}; \mathbf{h}_{t}^{l-1}] + \mathbf{b}_{f}^{l}) \\ s_{t}^{l} &= f_{t}^{l} s_{t-1}^{l} + i_{t}^{l} \tanh\left(\mathbf{W}_{s}^{l}[\mathbf{X}_{t}; \mathbf{h}_{t-1}^{l}; \mathbf{h}_{t}^{l-1}] + \mathbf{b}_{s}^{l}\right) \\ o_{t}^{l} &= \sigma(\mathbf{W}_{o}^{l}[\mathbf{X}_{t}; \mathbf{h}_{t-1}^{l}; \mathbf{h}_{t}^{l-1}] + \mathbf{b}_{o}^{l}) \\ \mathbf{h}_{t}^{l} &= o_{t}^{l} \tanh(s_{t}^{l}) \end{split}$$

output vector, inference vector

$$egin{aligned} & oldsymbol{v}_t = \mathbf{W}_y igg[ h_t^1 \, ; ... ; h_t^L igg] \ & eta_t = \mathbf{W}_{eta} igg[ h_t^1 \, ; ... ; h_t^L igg] \end{aligned}$$

#### Interface Vector

- The interface vector contains parameters
- Controller uses interface vector to connect to memory for Reading/Writing

$$\xi_{t} = \left[\mathbf{k}_{t}^{\mathbf{r},1}; ...; \mathbf{k}_{t}^{\mathbf{r},R}; \hat{\beta}_{t}^{\mathbf{r},1}; ...; \hat{\beta}_{t}^{\mathbf{r},R}; \mathbf{k}_{t}^{w}; \hat{\beta}_{t}^{w}; \hat{\mathbf{e}}_{t}; \mathbf{v}_{t}; \hat{\mathbf{f}}_{t}^{1}; ...; \hat{\mathbf{f}}_{t}^{R}; \hat{\mathbf{g}}_{t}^{u}; \hat{\mathbf{g}}_{t}^{w}; \hat{\boldsymbol{\pi}}_{t}^{1}; ...; \hat{\boldsymbol{\pi}}_{t}^{R}\right]$$

location selection for reading/writing depends on weighting(s)

$$\Delta_N = \left\{ \boldsymbol{\alpha} \in \mathbb{R}^N : \boldsymbol{\alpha}_i \in [0,1], \sum_{i=1}^N \boldsymbol{\alpha}_i \le 1 \right\}$$



The controller interpolates among these mechanisms using scalar gates





temporal memory linkage

 $b_{t}, f_{t}$ 

$$\boldsymbol{\xi}_{t} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{\boldsymbol{f}}_{t}^{1}; ...; \hat{\boldsymbol{f}}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\boldsymbol{\beta}}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\mathbf{r},1}; ...; \mathbf{k}_{t}^{\mathbf{r},R}; \hat{\boldsymbol{\beta}}_{t}^{\mathbf{r},1}; ...; \hat{\boldsymbol{\beta}}_{t}^{\mathbf{r},R}; \hat{\boldsymbol{\pi}}_{t}^{1}; ...; \hat{\boldsymbol{\pi}}_{t}^{R} \end{bmatrix}$$





In writing, location selection depends on single weighting

$$\left|\mathbf{w}_{t}^{\mathrm{w}}\in\Delta_{N}\right|$$

Now, we can write the write vector  $\mathbf{v}_{t} \in \mathbb{R}^{w}$  (after erase)

$$\boldsymbol{M}_{t} = \boldsymbol{M}_{t-1} \circ \left( \boldsymbol{E} - \mathbf{w}_{t}^{w} \mathbf{e}_{t}^{T} \right) + \mathbf{w}_{t}^{w} \mathbf{v}_{t}^{T}$$

$$\boldsymbol{\hat{\boldsymbol{\xi}}_{t}} = \left[ \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{f}_{t}^{1}; \dots; \hat{f}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\boldsymbol{\beta}}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\text{r.l.}} \dots; \mathbf{k}_{t}^{\text{r.R.}}; \hat{\boldsymbol{\beta}}_{t}^{\text{r.l.}} \dots; \hat{\boldsymbol{\beta}}_{t}^{\text{r.R.}}; \hat{\boldsymbol{\pi}}_{t}^{1}; \dots; \hat{\boldsymbol{\pi}}_{t}^{R} \right]$$



$$\boldsymbol{\xi}_{t} = \left[\mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{\boldsymbol{f}}_{t}^{1}; ...; \hat{\boldsymbol{f}}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\boldsymbol{\beta}}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\mathrm{r.l}}; ...; \mathbf{k}_{t}^{\mathrm{r.R}}; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r.l}}; ...; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r.R}}; \hat{\boldsymbol{\pi}}_{t}^{1}; ...; \hat{\boldsymbol{\pi}}_{t}^{R}\right]$$



#### **content lookup** operations on the memory $M \in \mathbb{R}^{N \times M}$

$$C(M, \mathbf{k}, \beta)[i] = \frac{\exp\{D(\mathbf{k}, M[i, .])\beta\}}{\sum_{j} \exp\{D(\mathbf{k}, M[j, .])\beta\}}$$
key strength (scalar)

$$D(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}.\mathbf{v}}{|\mathbf{u}||\mathbf{v}|}$$

content weighting 
$$\mathbf{c}_{t}^{\mathrm{W}} = C(M_{t-1}, \mathbf{k}_{t}^{\mathrm{W}}, \beta_{t}^{\mathrm{W}})$$

$$\boldsymbol{\hat{\boldsymbol{\xi}}_{t}} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{f}_{t}^{1}; ...; \hat{f}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\boldsymbol{\beta}}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\text{r.l.}} ...; \mathbf{k}_{t}^{\text{r.R.}}; \hat{\boldsymbol{\beta}}_{t}^{\text{r.l.}} :...; \hat{\boldsymbol{\beta}}_{t}^{\text{r.R.}}; \hat{\boldsymbol{\pi}}_{t}^{\text{l.}} :...; \hat{\boldsymbol{\pi}}_{t}^{\text{R.}} \end{bmatrix}$$



$$\boldsymbol{\xi}_{t} = \left[\mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{\boldsymbol{f}}_{t}^{1}; ...; \hat{\boldsymbol{f}}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\boldsymbol{\beta}}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\mathrm{r.l}}; ...; \mathbf{k}_{t}^{\mathrm{r.R}}; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r.l}}; ...; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r.R}}; \hat{\boldsymbol{\pi}}_{t}^{1}; ...; \hat{\boldsymbol{\pi}}_{t}^{R}\right]$$

#### dynamic memory allocation

- dynamic memory allocation
- The controller should have a mechanism to free and allocate memory-when needed
- The controller emits free gates  $\hat{f}_t^1$ ;...;  $\hat{f}_t^R$  (one per read head)
- can the most recently read locations be freed?

Aim of dynamic memory allocation:

finding allocation weightings  $a_t$  (history based)

(used to provide new locations for writing)

$$\boldsymbol{\xi}_{t} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{\boldsymbol{f}}_{t}^{1}; \dots; \hat{\boldsymbol{f}}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\boldsymbol{\beta}}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\mathrm{r,1}}; \dots; \mathbf{k}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r,1}}; \dots; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{\pi}}_{t}^{1}; \dots; \hat{\boldsymbol{\pi}}_{t}^{R} \end{bmatrix}$$

#### dynamic memory allocation



- free-gates determines if the most recently read locations can be freed if high free gate, so DCN can forget those memories and write new information to those locations
- I) free-gates are used to define memory retention vector
- II) Usage vector, then, is defined as follow

$$\mathbf{u}_{t} = \left(\mathbf{u}_{t-1} + \mathbf{w}_{t-1}^{\mathbf{w}} - \mathbf{u}_{t-1} \circ \mathbf{w}_{t-1}^{\mathbf{w}}\right) \circ \prod_{i=1}^{R} \left(\mathbf{1} - f_{t}^{i} \mathbf{w}_{t-1}^{r,i}\right)$$

If for a memory cell  $\psi_{t}[i] \approx 0$ , then it have been recently read

$$\boldsymbol{\xi}_{t} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{\boldsymbol{f}}_{t}^{1} \dots \hat{\boldsymbol{f}}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\boldsymbol{\beta}}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\mathrm{r.l}} \dots \mathbf{k}_{t}^{\mathrm{r.R}}; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r.l}} \dots \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r.R}}; \hat{\boldsymbol{\pi}}_{t}^{1} \dots \hat{\boldsymbol{\pi}}_{t}^{R} \end{bmatrix}$$

#### dynamic memory allocation



- free-gates determines if the most recently read locations can be freed if high free gate, so DCN can forget those memories and write new information to those locations
- I) free-gates are used to define memory retention vector
- II) Usage vector, then, is defined as follow

$$\mathbf{u}_{t} = \left(\mathbf{u}_{t-1} + \mathbf{w}_{t-1}^{\mathbf{w}} - \mathbf{u}_{t-1} \circ \mathbf{w}_{t-1}^{\mathbf{w}}\right) \circ \prod_{i=1}^{R} \left(\mathbf{1} - f_{t}^{i} \mathbf{w}_{t-1}^{r,i}\right)$$

if DNC has not recently write to

$$\psi_i$$

if DNC has recently read from

$$\boldsymbol{\xi}_{t} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{\boldsymbol{f}}_{t}^{1} & \hat{\boldsymbol{f}}_{t}^{R}; \mathbf{k}_{t}^{W}; \hat{\boldsymbol{\beta}}_{t}^{W}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{W}; \mathbf{k}_{t}^{\mathrm{r,l}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r,l}}; \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{\pi}}_{t}^{1} & \hat{\boldsymbol{f}}_{t}^{R} \end{bmatrix}$$



Memory usage vector tracks the usage of each memory location

$$\mathbf{u}_{t} = \left(\mathbf{u}_{t-1} + \mathbf{w}_{t-1}^{\mathbf{w}} - \mathbf{u}_{t-1} \circ \mathbf{w}_{t-1}^{\mathbf{w}}\right) \circ \prod_{i=1}^{R} \left(\mathbf{1} - f_{t}^{i} \mathbf{w}_{t-1}^{r,i}\right)$$

$$\boldsymbol{\psi}_{t}$$

$$\boldsymbol{\phi}_{t} = SortIndicesAscending\left(\mathbf{u}_{t}\right)$$

$$\boldsymbol{\phi}_{t}[1] \text{ is the "least used" index of } \mathbf{u}_{t}$$

Allocation weightings: used to provide new locations for the writing

$$a_{t}\left[\phi_{t}[j]\right] = \left(1 - \left[\phi_{t}[j]\right]\right) \prod_{i=1}^{j-1} \mathbf{u}_{t}[i]$$

If usages are 1, then  $a_r = 0$  and the controller can not longer allocate memory

$$\boldsymbol{\xi}_{t} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{\boldsymbol{f}}_{t}^{1} & \hat{\boldsymbol{f}}_{t}^{R}; \mathbf{k}_{t}^{W}; \hat{\boldsymbol{\beta}}_{t}^{W}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{W}; \mathbf{k}_{t}^{\mathrm{r,l}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r,l}}; \dots; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{\pi}}_{t}^{\mathrm{l}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{\pi}}_{t}^{\mathrm{l}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{\pi}}_{t}^{\mathrm{l}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}} & \hat{\boldsymbol{f}}_{t}^{\mathrm{r,R}}; \hat{\boldsymbol{f$$

#### Writing on Memory and Weightings



$$\boldsymbol{M}_{t} = \boldsymbol{M}_{t-1} \circ \left( \boldsymbol{E} - \mathbf{w}_{t}^{w} \mathbf{e}_{t}^{T} \right) + \mathbf{w}_{t}^{w} \mathbf{v}_{t}^{T}$$

$$a_t \left[ \phi_t[j] \right] = \left( 1 - \left[ \phi_t[j] \right] \right) \prod_{i=1}^{j-1} \mathbf{u}_t[i]$$

content weighting

$$\mathbf{c}_{t}^{\mathrm{W}} = C(M_{t-1}, \mathbf{k}_{t}^{\mathrm{W}}, \beta_{t}^{\mathrm{W}})$$

$$\boldsymbol{M}_{t} = \boldsymbol{M}_{t-1} \circ \left( \boldsymbol{E} - \mathbf{w}_{t}^{w} \mathbf{e}_{t}^{T} \right) + \mathbf{w}_{t}^{w} \mathbf{v}_{t}^{T}$$

$$\mathbf{w}_{t}^{\mathbf{w}} = g_{t}^{\mathbf{w}} \left[ g_{t}^{a} \mathbf{a}_{t} + (1 - g_{t}^{a}) \mathbf{c}_{t}^{\mathbf{w}} \right]$$

 $g_t^{\text{W}} \in [0,1]$  is the write gate, if write gate is zero, then nothing is written  $g_t^a \in [0,1]$  the allocation gate governing the interpolation

$$\boldsymbol{\xi}_{t} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{\boldsymbol{f}}_{t}^{1}; \dots \hat{\boldsymbol{f}}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\boldsymbol{g}}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\mathrm{r.l}}; \dots; \mathbf{k}_{t}^{\mathrm{r.R}}; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r.l}}; \dots; \hat{\boldsymbol{\beta}}_{t}^{\mathrm{r.R}}; \hat{\boldsymbol{\pi}}_{t}^{1}; \dots; \hat{\boldsymbol{\pi}}_{t}^{R} \end{bmatrix}$$



$$\boldsymbol{\xi}_{t} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{f}_{t}^{1}; ...; \hat{f}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\beta}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\mathbf{r},1}; ...; \mathbf{k}_{t}^{\mathbf{r},R}; \hat{\boldsymbol{\beta}}_{t}^{\mathbf{r},1}; ...; \hat{\boldsymbol{\beta}}_{t}^{\mathbf{r},R}; \hat{\boldsymbol{\pi}}_{t}^{1}; ...; \hat{\boldsymbol{\pi}}_{t}^{R} \end{bmatrix}$$

#### Architecture- Differentiable Neural Computing (DNC)



Memory: Reading

 $\Delta_N = \left\{ \boldsymbol{\alpha} \in \mathbb{R}^N : \boldsymbol{\alpha}_i \in [0,1], \sum_{i=1}^N \boldsymbol{\alpha}_i \le 1 \right\}$ 

In reading, location select depends on weightings

$$\left\{\mathbf{w}_{t}^{r,1},\ldots,\mathbf{w}_{t}^{r,R}\in\Delta_{N}\right\}$$

Now, we can define the read vectors  $\{\mathbf{r}_t^1,...,\mathbf{r}_t^R\}$ 

$$\mathbf{r}_t^i = \boldsymbol{M}_t^T \mathbf{w}_t^{r,i}$$



## **content** lookup operations on the memory $M \in \mathbb{R}^{N \times M}$ use the following function

$$C(M, \mathbf{k}, \beta)[i] = \frac{\exp\{D(\mathbf{k}, M[i,.])\beta\}}{\sum_{j} \exp\{D(\mathbf{k}, M[j,.])\beta\}}$$

$$D(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}.\mathbf{v}}{|\mathbf{u}||\mathbf{v}|}$$

content weighting 
$$\mathbf{c}_t^{r,i} = C(M_{t-1}, \mathbf{k}_t^{r,i}, \beta_t^{r,i})$$

$$\boldsymbol{\xi}_{t} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{f}_{t}^{1}; ...; \hat{f}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\beta}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\mathbf{r}, \mathbf{l}}; ...; \mathbf{k}_{t}^{\mathbf{r}, \mathbf{R}}; \hat{\boldsymbol{\beta}}_{t}^{\mathbf{r}, \mathbf{l}}; ...; \hat{\boldsymbol{\beta}}_{t}^{\mathbf{r}, \mathbf{R}}; \hat{\boldsymbol{\pi}}_{t}^{1}; ...; \hat{\boldsymbol{\pi}}_{t}^{R} \end{bmatrix}$$



$$\boldsymbol{\xi}_{t} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{f}_{t}^{1}; ...; \hat{f}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\beta}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\mathbf{r},1}; ...; \mathbf{k}_{t}^{\mathbf{r},R}; \hat{\boldsymbol{\beta}}_{t}^{\mathbf{r},1}; ...; \hat{\boldsymbol{\beta}}_{t}^{\mathbf{r},R}; \hat{\boldsymbol{\pi}}_{t}^{1}; ...; \hat{\boldsymbol{\pi}}_{t}^{R} \end{bmatrix}$$

#### **Temporal Memory Linkage**



## The order of stored/written information is important for reading

e.g. retrieving instructions



temporal memory linkage

$$L_{t} \in [0,1]^{N \times N}$$



location i was written to just after location j in the pervious time step.

$$j_{\mathsf{t-}1}$$







To define  $L_{_{\!\scriptscriptstyle{f}}}$ we need a precedence weighting  $\mathbf{p}_t$ 

$$\mathbf{p}_t = \underbrace{1 - \sum_i \mathbf{w}_t^w[i]}_{i} \mathbf{p}_{t-1} + \underbrace{\mathbf{w}_t^w}_{t}$$
 adds new linages remove the old linkages

 $\mathbf{p}_{i}[i]$  represents the degree to which location i was the last one written to.

DNC updates  $\mathbf{p}_t$  based on how much writing occurred at the current time-step (by  $\mathbf{w}_t^{\mathrm{w}}$ )

if 
$$\mathbf{W}_{t}^{W} \approx 0 \implies \mathbf{p}_{t} \approx \mathbf{p}_{t-1}$$
 (no writing)



precedence weighting  $\mathbf{p}_{t}$  is used to update a temporal link matrix

$$L_{t}[i,j] = (1 - \mathbf{w}_{t}^{w}[i] - \mathbf{w}_{t}^{w}[j])L_{t-1}[i,j] + \mathbf{w}_{t}^{w}[i] \mathbf{p}_{t-1}[j]$$

tracks the order of writings on the memory locations

the write weight of location i at current time

precedence weigh of location j at pervious



$$L_{t}[i,j] = (1 - \mathbf{w}_{t}^{w}[i] - \mathbf{w}_{t}^{w}[j])L_{t-1}[i,j] + \mathbf{w}_{t}^{w}[i] \mathbf{p}_{t-1}[j]$$

 $L_{t}[i,j]$  allowing the controller to iterate forward or backward in time

$$\left[ egin{array}{c} f_t^{\,i} = L_t \mathbf{w}_{ ext{t-1}}^{ ext{r,i}} \end{array} 
ight] \left[ egin{array}{c} b_t^i = L_t^T \mathbf{w}_{ ext{t-1}}^{ ext{r,i}} \end{array} 
ight]$$

$$\boldsymbol{\xi}_{t} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{f}_{t}^{1}; ...; \hat{f}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\beta}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\mathbf{r},1}; ...; \mathbf{k}_{t}^{\mathbf{r},R}; \hat{\beta}_{t}^{\mathbf{r},1}; ...; \hat{\beta}_{t}^{\mathbf{r},R}; \hat{\pi}_{t}^{1}; ...; \hat{\pi}_{t}^{R} \end{bmatrix}$$

#### Architecture- Differentiable Neural Computing (DNC)



 $b_t, f_t$ 

# content- based addressing temporal memory linkage

#### content based addressing

$$\mathbf{c}_{t}^{r,i} = C(M_{t-1}, \mathbf{k}_{t}^{r,i}, \beta_{t}^{r,i})$$

### temporal memory linkage

$$L_{t}[i,j] = (1 - \mathbf{w}_{t}^{w}[i] - \mathbf{w}_{t}^{w}[j])L_{t-1}[i,j] + \mathbf{w}_{t}^{w}[i] \mathbf{p}_{t-1}[j]$$

$$f_t^i = L_t \mathbf{w}_{\mathsf{t-1}}^{\mathsf{r},\mathsf{i}} \quad b_t^i = L_t^T \mathbf{w}_{\mathsf{t-1}}^{\mathsf{r},\mathsf{i}}$$

## Reading weight (using three way gates $\pi_t^\iota$ )

$$\mathbf{W}_{t}^{r,i} = \pi_{t}^{i}[1] \mathbf{b}_{t}^{i} + \pi_{t}^{i}[2] \mathbf{c}_{t}^{r,i} + \pi_{t}^{i}[3] f_{t}^{i}$$

$$\boldsymbol{\xi}_{t} = \begin{bmatrix} \mathbf{v}_{t}; \hat{\mathbf{e}}_{t}; \hat{f}_{t}^{1}; ...; \hat{f}_{t}^{R}; \mathbf{k}_{t}^{w}; \hat{\beta}_{t}^{w}; \hat{\mathbf{g}}_{t}^{a}; \hat{\mathbf{g}}_{t}^{w}; \mathbf{k}_{t}^{\mathbf{r},1}; ...; \mathbf{k}_{t}^{\mathbf{r},R}; \hat{\beta}_{t}^{\mathbf{r},1}; ...; \hat{\beta}_{t}^{\mathbf{r},R}; \hat{\pi}_{t}^{1}; ...; \hat{\pi}_{t}^{R} \end{bmatrix}$$

#### **Dynamic Memory Allocation - Test**

- present sequences to the network

 controller's aim is to recreate of the input If could recreate,

then that input is not needed again

Outputs

- many input sequences:
  - but the size of memory is 10;
  - it must erase and write

Goal: to test whether the memory allocation system would be used and free and re-use the locations



#### **DNC-Test on Graphs**

DNC is tested on answer questions from about the graphs

- I) London Underground graph
  - DNC asked single navigation questions



The Goal was not solve the problems in the graphs,
The goals was to show either DNC can create and navigate throughout the graph.

#### **DNC-Test on Graphs**

DNC is tested on answer questions from about the graphs

II) Family graph: the relation between people has been asked which has not been shown in the training time.



The goal was not solve the problems in the graphs,
The goals was to show either DNC can create and navigate throughout the graph.

#### **DNC-Test on Graphs**

DNC is tested on answer questions from about the graphs

II) family graph which the relation between people has been asked which has not been shown in the training time.

https://www.youtube.com/watch?v=BgU8sI7TcMY



The goal was **not** solve the problems in the graphs,
The goals was to show either DNC can create and navigate throughout the graph.

# Thank You!